A Data Mining Approach for Detecting Collusion in Unproctored Online Exams

J. Langerbein(1), T. Massing(1), J. Klenke(1), M. Striewe(1), M. Goedicke(1)
, C. Hanck(1), N. Reckmann(1)


(1) Chair of Econometrics, University of Duisburg-Essen; Germany

Introduction

During the COVID-19 pandemic, we recognized the challenge universities faced transitioning to online classes and exams, especially given the limitations of proctoring due to privacy regulations. Recognizing the limitations in preventing illicit collaboration during online exams, we decided to propose a data-driven method to identify potential collusion among students post-exam. We developed a method using an alternative distance measure and hierarchical clustering algorithms to pinpoint groups of students with remarkably similar exam results. Our method builds upon previous research that used exam event logs to detect collusion. Additionally, we present an approach to categorize groups as “outstandingly similar” using a proctored comparison group, with further details about our methodology and results provided in the subsequent sections of our paper.

Methodology

The study utilized data from the Descriptive Statistics course at the University Duisburg-Essen, Germany, comparing a test group, which took an unproctored exam at home during the COVID-19 pandemic, to a comparison group that had a proctored exam in class prior to the pandemic. Both groups’ exams encompassed arithmetical problems, R programming tasks, and a short essay task. During the exams, students’ activities and time stamps were recorded in event logs, and the points achieved per task were noted. The dataset was cleaned to remove students with minimal participation or achievement and those who experienced internet issues, ensuring the comparability of both groups, despite the differing exam formats, as they shared similar content and learning goals.

The agglomerative (bottom-up) hierarchical clustering algorithm can be described by following equations:

\[D(x_i, x_{i'}) = \frac{1}{h} \sum_{j=1}^h w_j \cdot d_j(x_{ij}, x_{i'j})\] \(D(x_i, x_{i'})\) is the global pairwise dissimilarity, while \(d_j(x_{ij}, x_{i'j})\) denotes the pairwise attribute dissimilarity. The weights \(w_j\) sum up to 1. Index \(i\) denotes the number of Students (\(N = 151\))\(i = 1, ..., N\) with \(N = 151\) students, while \(j\) is the index for each of the \(h\) attributes.

We compared two different kinds of attributes, namely the dissimilarities in the student´s event patters (time of submission) as well as the dissimilarities in points achieved.

Empirical results

Discussion

In our discussion, we interpret the results of hierarchical clustering algorithms which are visually represented through a dendrogram, a tree-like structure. After comparing various algorithms, we find average linkage clustering to be the most fitting for our analysis, helping us identify compact clusters, notably clusters A, B, and E, indicative of no larger group collusion. Additional visual tools like scatterplots and barcharts aid in examining student similarities within these clusters. The comparison with a reference group supports the effectiveness of our method in detecting collusion, but limitations exist due to the unknown ground truth. Despite this, our approach not only helps deter cheating in unproctored exams but also contributes to the broader digital transformation of education, preparing us for any unforeseen future challenges similar to the COVID-19 pandemic.

References

\[\begin{align} \widehat{\operatorname{MAD}}_\tau &(\mathcal{S}^{tr}, S^{tr}, \Pi) = \Big |\frac{1}{ N^{tr}} \sum \limits_{i \in \mathcal{S}^{tr}} {\hat\tau}(X_i; \mathcal{S}^{tr}, \Pi) \\ &-\hat{\tau}_{HL}(X_i; \mathcal{S}^{tr}, \Pi)\Big|, \end{align}\]

\[\begin{align} \widehat{\operatorname{MSD}}_\tau &(\mathcal{S}^{tr}, \mathcal{S}^{tr}, \Pi) = \frac{1}{N^{tr}} \sum \limits_{i \in \mathcal{S}^{tr}} \hat{\tau}^2(X_i; \mathcal{S}^{tr}, \Pi) \\ &-2 \hat\tau(X_i; \mathcal{S}^{tr}, \Pi) \hat{\tau}_{HL}(X_i; \mathcal{S}^{tr}, \Pi), \end{align}\]

\[\begin{align} \widehat{\operatorname{LMS}}_\tau (\mathcal{S}^{tr}, \mathcal{S}^{tr}, \Pi) =\underset{i \in \mathcal{S}^{tr}}{\operatorname{med}}\thinspace [(Y_i -& \hat{\mu}(D_i, X_i; \\ & \mathcal{S}^{tr}, \Pi))^2]. \end{align}\]